2024-10-31
The dm500.Rds data contains four important variables + Subject ID on 500 adults with diabetes.
We want to predict the subject’s current Hemoglobin A1c level (a1c), using (up to) three predictors:
a1c_old: subject’s Hemoglobin A1c (in %) two years agoage: subject’s age in years (between 30 and 70)income: median income of subject’s neighborhood (3 levels)a1c is our outcome, which we’ll predict using three models …
a1c_old alone to predict a1ca1c_old and age together to predict a1ca1c_old, age, and income together to predict a1cdm500 data# A tibble: 500 × 5
a1c a1c_old age income subject
<dbl> <dbl> <dbl> <fct> <chr>
1 6.3 11.4 62 Higher_than_50K S-001
2 11 16.3 54 Between_30-50K S-002
3 8.7 10.7 47 <NA> S-003
4 6.5 5.8 53 Below_30K S-004
5 6.7 6.3 64 Between_30-50K S-005
6 5.8 6.5 48 Below_30K S-006
7 9.6 NA 49 Below_30K S-007
8 6.1 7.2 63 Between_30-50K S-008
9 12.9 7.7 55 Below_30K S-009
10 6.7 6.8 63 Below_30K S-010
# ℹ 490 more rows
Today, we’ll assume all missing values are Missing at Random (MAR) and create 10 imputations but just use the 7th.
set.seed(20241031)
dm500_tenimps <- mice(dm500, m = 10, printFlag = FALSE)
dm500_i <- complete(dm500_tenimps, 7) |> tibble()
n_miss(dm500)[1] 24
[1] 0
dm500_i tibbleselect(dm500_i, -subject) (500 rows and 4 variables, 4 shown)
ID | Name | Type | Missings | Values | N
---+---------+-------------+----------+-----------------+------------
1 | a1c | numeric | 0 (0.0%) | [4.3, 16.7] | 500
---+---------+-------------+----------+-----------------+------------
2 | a1c_old | numeric | 0 (0.0%) | [4.2, 16.3] | 500
---+---------+-------------+----------+-----------------+------------
3 | age | numeric | 0 (0.0%) | [31, 70] | 500
---+---------+-------------+----------+-----------------+------------
4 | income | categorical | 0 (0.0%) | Higher_than_50K | 124 (24.8%)
| | | | Between_30-50K | 198 (39.6%)
| | | | Below_30K | 178 (35.6%)
---------------------------------------------------------------------
a1cOur goal is accurate prediction of a1c values. Suppose we have decided to consider these three possible models…
a1c_old alone to predict a1ca1c_old and age together to predict a1ca1c_old, age, and income together to predict a1cIt can scarcely be denied that the supreme goal of all theory is to make the irreducible basic elements as simple and as few as possible without having to surrender the adequate representation of a single datum of experience. (A. Einstein)
Entities should not be multiplied without necessity. (Occam’s razor)
On Parsimony: Since all models are wrong the scientist cannot obtain a “correct” one by excessive elaboration. On the contrary following William of Occam he should seek an economical description of natural phenomena. Just as the ability to devise simple but evocative models is the signature of the great scientist so overelaboration and overparameterization is often the mark of mediocrity.
On Worrying Selectively: Since all models are wrong the scientist must be alert to what is importantly wrong. It is inappropriate to be concerned about mice when there are tigers abroad.
… all models are approximations. Essentially, all models are wrong, but some are useful. However, the approximate nature of the model must always be borne in mind.
dm500_i (60-80% is common) for model training.anti_join() to pull subjects not in dm500_i_train.from Posit’s Data Transformation Cheat Sheet
“Mutating Joins” join one table to columns from another, matching values with the rows that the correspond to. Each join retains a different combination of values from the tables.
left_join(x, y): Join matching values from y to x.right_join(x, y): Join matching values from x to y.inner_join(x, y): Join data. retain only rows with matches.full_join(x, y): Join data. Retain all values, all rows.from Posit’s Data Transformation Cheat Sheet
“Filtering Joins” filter one table against the rows of another.
semi_join(x, y): Return rows of x that have a match in y. Use to see what will be included in a join.anti_join(x, y): Return rows of x that do not have a match in y. Use to see what will not be included in a join.Use by = join_by(col1, col2, ...) to specify one or more common columns to match on.
a1c (outcome)p1 <- ggplot(dm500_i_train, aes(x = a1c)) +
geom_histogram(binwidth = 0.5,
fill = "slateblue", col = "white")
p2 <- ggplot(dm500_i_train, aes(sample = a1c)) +
geom_qq(col = "slateblue") + geom_qq_line(col = "violetred") +
labs(y = "Observed a1c", x = "Normal (0,1) quantiles") +
theme(aspect.ratio = 1)
p3 <- ggplot(dm500_i_train, aes(x = "", y = a1c)) +
geom_violin(fill = "slateblue", alpha = 0.1) +
geom_boxplot(fill = "slateblue", width = 0.3, notch = TRUE,
outlier.color = "slateblue", outlier.size = 3) +
labs(x = "") + coord_flip()
p1 + p2 - p3 +
plot_layout(ncol = 1, height = c(3, 2)) +
plot_annotation(title = "Hemoglobin A1c values (%)",
subtitle = str_glue("Model Development Sample: ", nrow(dm500_i_train),
" adults with diabetes"))a1c (outcome)We want to try to identify a good transformation for the conditional distribution of the outcome, given the predictors, in an attempt to make the linear regression assumptions of linearity, Normality and constant variance more appropriate.
| Transformation | \(y^2\) | y | \(\sqrt{y}\) | log(y) | \(1/y\) | \(1/y^2\) |
|---|---|---|---|---|---|---|
| \(\lambda\) | 2 | 1 | 0.5 | 0 | -1 | -2 |
p1 <- ggplot(dm500_i_train, aes(x = log(a1c))) +
geom_histogram(bins = 15,
fill = "royalblue", col = "white")
p2 <- ggplot(dm500_i_train, aes(sample = log(a1c))) +
geom_qq(col = "royalblue") + geom_qq_line(col = "magenta") +
labs(y = "Observed log(a1c)", x = "Normal (0,1) quantiles") +
theme(aspect.ratio = 1)
p3 <- ggplot(dm500_i_train, aes(x = "", y = log(a1c))) +
geom_violin(fill = "royalblue", alpha = 0.1) +
geom_boxplot(fill = "royalblue", width = 0.3, notch = TRUE,
outlier.color = "royalblue", outlier.size = 3) +
labs(x = "", y = "Natural log of Hemoglobin A1c") + coord_flip()
p1 + p2 - p3 +
plot_layout(ncol = 1, height = c(3, 2)) +
plot_annotation(title = "Natural Logarithm of Hemoglobin A1c",
subtitle = str_glue("Model Development Sample: ", nrow(dm500_i_train),
" adults with diabetes"))bcPower Transformation to Normality
Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
Y1 -1.0208 -1 -1.396 -0.6456
Likelihood ratio test that transformation parameter is equal to 0
(log transformation)
LRT df pval
LR test, lambda = (0) 29.14439 1 6.718e-08
Likelihood ratio test that no transformation is needed
LRT df pval
LR test, lambda = (1) 116.6815 1 < 2.22e-16
p1 <- ggplot(dm500_i_train, aes(x = (1/a1c))) +
geom_histogram(bins = 15,
fill = "forestgreen", col = "white")
p2 <- ggplot(dm500_i_train, aes(sample = (1/a1c))) +
geom_qq(col = "forestgreen") + geom_qq_line(col = "tomato") +
labs(y = "Observed 1/a1c", x = "Normal (0,1) quantiles") +
theme(aspect.ratio = 1)
p3 <- ggplot(dm500_i_train, aes(x = "", y = (1/a1c))) +
geom_violin(fill = "forestgreen", alpha = 0.1) +
geom_boxplot(fill = "forestgreen", width = 0.3, notch = TRUE,
outlier.color = "forestgreen", outlier.size = 3) +
labs(x = "", y = "1/Hemoglobin A1c") + coord_flip()
p1 + p2 - p3 +
plot_layout(ncol = 1, height = c(3, 2)) +
plot_annotation(title = "Inverse of Hemoglobin A1c",
subtitle = str_glue("Model Development Sample: ", nrow(dm500_i_train),
" adults with diabetes"))# A tibble: 1 × 10
n miss mean sd med mad min q25 q75 max
<int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 350 0 0.134 0.0291 0.137 0.0286 0.0599 0.115 0.154 0.233
# A tibble: 1 × 10
n miss mean sd med mad min q25 q75 max
<int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 350 0 13.4 2.91 13.7 2.86 5.99 11.5 15.4 23.3
temp <- dm500_i_train |>
mutate(transa1c = 100/a1c) |>
select(a1c_old, age, income, transa1c)
correlation(temp)# Correlation Matrix (pearson-method)
Parameter1 | Parameter2 | r | 95% CI | t(348) | p
---------------------------------------------------------------------
a1c_old | age | -0.19 | [-0.29, -0.09] | -3.59 | < .001***
a1c_old | transa1c | -0.61 | [-0.67, -0.54] | -14.34 | < .001***
age | transa1c | 0.19 | [ 0.09, 0.29] | 3.60 | < .001***
p-value adjustment method: Holm (1979)
Observations: 350
ggpairs() comes from the GGally package.dm500_i_train <- dm500_i_train |> mutate(transa1c = 100/a1c)
fit1 <- lm(transa1c ~ a1c_old, data = dm500_i_train)
fit2 <- lm(transa1c ~ a1c_old + age, data = dm500_i_train)
fit3 <- lm(transa1c ~ a1c_old + age + income,
data = dm500_i_train)
c(n_obs(fit1), n_obs(fit2), n_obs(fit3))[1] 350 350 350
n_obs() gives us the number of observations used to fit the model. It comes from the insight package in easystats.fit1)Parameter | Coefficient | SE | 95% CI | t(348) | p
-------------------------------------------------------------------
(Intercept) | 20.97 | 0.54 | [19.90, 22.04] | 38.49 | < .001
a1c old | -0.98 | 0.07 | [-1.12, -0.85] | -14.34 | < .001
\[ \hat{\frac{100}{A1c}} = 20.97 - 0.98 \mbox{ A1c_old} \]
$$\hat{\frac{100}{A1c}} = 20.97 - 0.98 \mbox{ A1c_old}$$fit1 equation (1/6)\[ \hat{\frac{100}{A1c}} = 20.97 - 0.98 \mbox{ A1c_old} \]
fit1 model estimates the value of (100/A1c) to be 20.97 if A1c_old = 0, but we have no values of A1c_old anywhere near 0 in our data1, nor is such a value plausible clinically, so the intercept doesn’t tell us much here.fit1 equation (2/6)\[ \hat{\frac{100}{A1c}} = 20.97 - 0.98 \mbox{ A1c_old} \]
Our fit1 model estimates the slope of A1c_old to be -0.98, with 95% CI (-1.12, -0.85). Interpret the point estimate…
A1c_old value that is one percentage point (A1c is measured in percentage points) higher than Sally’s. Our fit1 model predicts the 100/A1c value for Harry to be 0.98 less than that of Sally, on average.fit1 equation (3/6)\[ \hat{\frac{100}{A1c}} = 20.97 - 0.98 \mbox{ A1c_old} \]
Our fit1 model estimates the slope of A1c_old to be -0.98, with 95% CI (-1.12, -0.85). What can we say about the CI?
fit1 equation (4/6)Slope of A1c_old in fit1 is -0.98, with 95% CI (-1.12, -0.85).
fit1 estimates a slope of -0.98 in study participants.fit1 being correct.fit1 equation (5/6)Slope of A1c_old in fit1 is -0.98, with 95% CI (-1.12, -0.85).
A1c_old between -1.12 and -0.85, assuming the participants are representative of the population of interest, and assuming the underlying linear model fit1 is correct.fit1 equation (6/6)To find such a probability interval, we’d need to combine our confidence interval (giving compatibility of data with population slope values) with meaningful prior information1 on which values for the population mean are plausible.
fit1)# Indices of model performance
AIC | AICc | BIC | R2 | R2 (adj.) | RMSE | Sigma
------------------------------------------------------------------
1583.106 | 1583.176 | 1594.680 | 0.371 | 0.370 | 2.303 | 2.309
fit1)# Indices of model performance
AIC | AICc | BIC | R2 | R2 (adj.) | RMSE | Sigma
------------------------------------------------------------------
1583.106 | 1583.176 | 1594.680 | 0.371 | 0.370 | 2.303 | 2.309
compare_performance() function weights AIC and BIC measures so that higher values indicate a better fit.fit2)Parameter | Coefficient | SE | 95% CI | t(347) | p
-------------------------------------------------------------------
(Intercept) | 19.42 | 1.02 | [17.42, 21.43] | 19.06 | < .001
a1c old | -0.96 | 0.07 | [-1.10, -0.82] | -13.79 | < .001
age | 0.02 | 0.01 | [ 0.00, 0.05] | 1.79 | 0.074
\[ \hat{\frac{100}{A1c}} = 19.42 - 0.96 \mbox{ A1c_old} + 0.02 \mbox{ Age} \]
A1c_old. On average, our fit2 model predicts Harry’s 100/A1c value to be 0.02 higher than Sally’s.fit2)# Indices of model performance
AIC | AICc | BIC | R2 | R2 (adj.) | RMSE | Sigma
------------------------------------------------------------------
1583.106 | 1583.176 | 1594.680 | 0.371 | 0.370 | 2.303 | 2.309
# Indices of model performance
AIC | AICc | BIC | R2 | R2 (adj.) | RMSE | Sigma
------------------------------------------------------------------
1581.884 | 1582.000 | 1597.316 | 0.377 | 0.374 | 2.292 | 2.302
fit2 has higher \(R^2\), higher adjusted \(R^2\), lower RMSE, lower Sigma, lower values of AIC and corrected AIC.fit1 has a lower value of BIC.Remember compare_performance() weights AIC and BIC so higher values indicate better fit.
# Comparison of Model Performance Indices
Name | Model | R2 | R2 (adj.) | RMSE | Sigma | AIC weights | AICc weights | BIC weights | Performance-Score
---------------------------------------------------------------------------------------------------------------
fit2 | lm | 0.377 | 0.374 | 2.292 | 2.302 | 0.648 | 0.643 | 0.211 | 85.71%
fit1 | lm | 0.371 | 0.370 | 2.303 | 2.309 | 0.352 | 0.357 | 0.789 | 14.29%
fit2 has higher \(R^2\), higher adjusted \(R^2\), lower RMSE, lower Sigma, higher (weighted) AIC and (weighted) AIC correctedfit1 has higher (weighted) BIC.fit1 vs. fit2 performancefit3)Parameter | Coefficient | SE | 95% CI | t(345) | p
-------------------------------------------------------------------------------
(Intercept) | 19.49 | 1.07 | [17.39, 21.59] | 18.24 | < .001
a1c old | -0.95 | 0.07 | [-1.09, -0.81] | -13.61 | < .001
age | 0.02 | 0.01 | [ 0.00, 0.05] | 1.69 | 0.092
income [Between_30-50K] | 0.05 | 0.32 | [-0.58, 0.69] | 0.16 | 0.873
income [Below_30K] | -0.21 | 0.33 | [-0.87, 0.44] | -0.64 | 0.522
\[ \hat{\frac{100}{A1c}} = 19.49 - 0.95 \mbox{ A1c_old} + 0.02 \mbox{ Age} \\ + 0.05 \mbox{(Inc 30-50)} - 0.21 \mbox{(Inc<30)} \]
\[ \hat{\frac{100}{A1c}} = 19.49 - 0.95 \mbox{ A1c_old} + 0.02 \mbox{ Age} \\ + 0.05 \mbox{(Inc 30-50)} - 0.21 \mbox{(Inc<30)} \]
A1c_old, but Harry lives in a neighborhood with income < $30K, while Sally lives in a neighborhood with income > $50K. On average, our fit3 model predicts Harry’s 100/A1c value to be 0.21 lower than Sally’s.# Indices of model performance
AIC | AICc | BIC | R2 | R2 (adj.) | RMSE | Sigma
------------------------------------------------------------------
1583.106 | 1583.176 | 1594.680 | 0.371 | 0.370 | 2.303 | 2.309
# Indices of model performance
AIC | AICc | BIC | R2 | R2 (adj.) | RMSE | Sigma
------------------------------------------------------------------
1581.884 | 1582.000 | 1597.316 | 0.377 | 0.374 | 2.292 | 2.302
# Indices of model performance
AIC | AICc | BIC | R2 | R2 (adj.) | RMSE | Sigma
------------------------------------------------------------------
1584.938 | 1585.183 | 1608.086 | 0.379 | 0.372 | 2.289 | 2.306
Remember compare_performance() weights AIC and BIC so higher values indicate better fit.
# Comparison of Model Performance Indices
Name | Model | R2 | R2 (adj.) | RMSE | Sigma | AIC weights | AICc weights | BIC weights | Performance-Score
---------------------------------------------------------------------------------------------------------------
fit2 | lm | 0.377 | 0.374 | 2.292 | 2.302 | 0.568 | 0.568 | 0.211 | 83.06%
fit3 | lm | 0.379 | 0.372 | 2.289 | 2.306 | 0.123 | 0.116 | 9.67e-04 | 43.26%
fit1 | lm | 0.371 | 0.370 | 2.303 | 2.309 | 0.308 | 0.316 | 0.788 | 26.54%
| Model | \(R^2\) | Adjusted \(R^2\) | Predictors |
|---|---|---|---|
fit1 |
0.371 | 0.370 | a1c_old |
fit2 |
0.377 | 0.374 | a1c_old, age |
fit3 |
0.379 | 0.372 | a1c_old, age, income |
fit3) always looks best (raw \(R^2\) is greedy)fit2 looks best.| Model | RMSE | Sigma | Predictors |
|---|---|---|---|
fit1 |
2.303 | 2.309 | a1c_old |
fit2 |
2.292 | 2.302 | a1c_old, age |
fit3 |
2.289 | 2.306 | a1c_old, age, income |
fit3 looks best by RMSE.fit2 looks best by Sigma (\(\sigma\)).Unweighted versions, from model_performance()…
| Model | AIC | AIC_c | BIC | Predictors |
|---|---|---|---|---|
fit1 |
1583.106 | 1583.176 | 1594.680 | a1c_old |
fit2 |
1581.884 | 1582.000 | 1597.316 | + age |
fit3 |
1584.938 | 1585.183 | 1608.086 | + income |
fit2 looks best by AIC and corrected AIC.fit1 looks best by BIC.WEIGHTED versions, from compare_performance()…
| Model | AIC (wtd) | AIC_c (wtd) | BIC (wtd) | Predictors |
|---|---|---|---|---|
fit1 |
0.308 | 0.316 | 0.788 | a1c_old |
fit2 |
0.568 | 0.568 | 0.211 | + age |
fit3 |
0.123 | 0.116 | 9.7e-04 | + income |
fit2 looks best by AIC and corrected AIC.fit1 looks best by BIC.check_model() in the training sampleR version 4.4.1 (2024-06-14 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 22631)
Locale:
LC_COLLATE=English_United States.utf8
LC_CTYPE=English_United States.utf8
LC_MONETARY=English_United States.utf8
LC_NUMERIC=C
LC_TIME=English_United States.utf8
Package version:
abind_1.4-8 askpass_1.2.1 backports_1.5.0
base64enc_0.1.3 bayestestR_0.15.0 bit_4.5.0
bit64_4.5.2 blob_1.2.4 boot_1.3-31
broom_1.0.7 bslib_0.8.0 cachem_1.1.0
callr_3.7.6 car_3.1-3 carData_3.0-5
cellranger_1.1.0 cli_3.6.3 clipr_0.8.0
coda_0.19-4.1 codetools_0.2-20 colorspace_2.1-1
compiler_4.4.1 conflicted_1.2.0 correlation_0.8.6
cowplot_1.1.3 cpp11_0.5.0 crayon_1.5.3
curl_5.2.3 data.table_1.16.2 datasets_4.4.1
datawizard_0.13.0 DBI_1.2.3 dbplyr_2.5.0
Deriv_4.1.6 digest_0.6.37 doBy_4.6.24
dplyr_1.1.4 dtplyr_1.3.1 easystats_0.7.3
effectsize_0.8.9 emmeans_1.10.5 estimability_1.5.1
evaluate_1.0.1 fansi_1.0.6 farver_2.1.2
fastmap_1.2.0 fontawesome_0.5.2 forcats_1.0.0
foreach_1.5.2 Formula_1.2-5 fs_1.6.5
gargle_1.5.2 generics_0.1.3 GGally_2.2.1
ggplot2_3.5.1 ggstats_0.7.0 glmnet_4.1-8
glue_1.8.0 googledrive_2.1.1 googlesheets4_1.1.1
graphics_4.4.1 grDevices_4.4.1 grid_4.4.1
gridExtra_2.3 gtable_0.3.6 haven_2.5.4
highr_0.11 hms_1.1.3 htmltools_0.5.8.1
httr_1.4.7 ids_1.0.1 insight_0.20.5
isoband_0.2.7 iterators_1.0.14 janitor_2.2.0
jomo_2.7-6 jquerylib_0.1.4 jsonlite_1.8.9
knitr_1.49 labeling_0.4.3 lattice_0.22-6
lifecycle_1.0.4 lme4_1.1-35.5 lubridate_1.9.3
magrittr_2.0.3 MASS_7.3-61 Matrix_1.7-0
MatrixModels_0.5.3 memoise_2.0.1 methods_4.4.1
mgcv_1.9.1 mice_3.16.0 microbenchmark_1.5.0
mime_0.12 minqa_1.2.8 mitml_0.4-5
modelbased_0.8.9 modelr_0.1.11 multcomp_1.4-26
munsell_0.5.1 mvtnorm_1.3-1 naniar_1.1.0
nlme_3.1-164 nloptr_2.1.1 nnet_7.3-19
norm_1.0.11.1 numDeriv_2016.8.1.1 openssl_2.2.2
ordinal_2023.12.4.1 pan_1.9 parallel_4.4.1
parameters_0.23.0 patchwork_1.3.0 pbkrtest_0.5.3
performance_0.12.4 pillar_1.9.0 pkgconfig_2.0.3
plyr_1.8.9 prettyunits_1.2.0 processx_3.8.4
progress_1.2.3 ps_1.8.1 purrr_1.0.2
quantreg_5.99 R6_2.5.1 ragg_1.3.3
rappdirs_0.3.3 RColorBrewer_1.1-3 Rcpp_1.0.13
RcppEigen_0.3.4.0.2 readr_2.1.5 readxl_1.4.3
rematch_2.0.0 rematch2_2.1.2 report_0.5.9
reprex_2.1.1 rlang_1.1.4 rmarkdown_2.29
rpart_4.1.23 rstudioapi_0.17.1 rvest_1.0.4
sandwich_3.1-1 sass_0.4.9 scales_1.3.0
see_0.9.0 selectr_0.4.2 shape_1.4.6.1
snakecase_0.11.1 SparseM_1.84.2 splines_4.4.1
stats_4.4.1 stringi_1.8.4 stringr_1.5.1
survival_3.7-0 sys_3.4.3 systemfonts_1.1.0
textshaping_0.4.0 TH.data_1.1-2 tibble_3.2.1
tidyr_1.3.1 tidyselect_1.2.1 tidyverse_2.0.0
timechange_0.3.0 tinytex_0.54 tools_4.4.1
tzdb_0.4.0 ucminf_1.2.2 UpSetR_1.4.0
utf8_1.2.4 utils_4.4.1 uuid_1.2.1
vctrs_0.6.5 viridis_0.6.5 viridisLite_0.4.2
visdat_0.6.0 vroom_1.6.5 withr_3.0.2
xfun_0.48 xml2_1.3.6 xtable_1.8-4
yaml_2.3.10 zoo_1.8-12
431 Class 19 | 2024-10-31 | https://thomaselove.github.io/431-2024/